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When data are sparse and/or predictors multicollinear, current implementation of sparse 
partial least squares (SPLS) does not give estimates for non-selected predictors nor 
provide a measure of inference. In response, an approach termed "all-possible" SPLS is 
proposed, which fits a SPLS model for all tuning parameter values across a set grid. Noted 
is the percentage of time a given predictor is chosen, as well as the average non-zero 
parameter estimate. Using a "large" number of multicollinear predictors, simulation 
confirmed variables not associated with the outcome were least likely to be chosen as 
sparsity increased across the grid of tuning parameters, while the opposite was true for 
those strongly associated. Lastly, variables with a weak association were chosen more 
often than those with no association, but less often than those with a strong relationship 
to the outcome. Similarly, predictors most strongly related to the outcome had the largest 
average parameter estimate magnitude, followed by those with a weak relationship, 
followed by those with no relationship. Across two independent studies regarding the 
relationship between volumetric MRI measures and a cognitive test score, this method 
confirmed a priori hypotheses about which brain regions would be selected most often 
and have the largest average parameter estimates. In conclusion, the percentage of time 
a predictor is chosen is a useful measure for ordering the strength of the relationship 
between the independent and dependent variables, serving as a form of inference. The 
average parameter estimates give further insight regarding the direction and strength of 
association. As a result, all-possible SPLS gives more information than the dichotomous 
output of traditional SPLS, making it useful when undertaking data exploration and 
hypothesis generation for a large number of potential predictors. 
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INTRODUCTION 

In fields such as neuroscience, chemometrics, and genetics, data is 
often collected on a large number of variables but with a relatively 
small sample size, and predictors may also be highly coUinear. 
Statistical methods used in this setting include regression mod- 
els, cluster analysis and/or tree-based methods, ridge regression 
and dimension-reduction techniques such as partial least squares 
(PLS). However, when variable selection is the goal, these may 
prove inadequate or difficult to interpret. 

In the realm of ordinary least squares (OLS), multicollinear- 
ity affects both the stability of the estimated coefficients (Wold 
et al, 1984) and inference on these estimates (Farrar and Glauber, 
1 967). Essentially, model prediction ability is poor when estimates 
are unstable (Wold et al., 1984), and one cannot trust conclusions 
drawn from test statistics, p-values or confidence intervals due to 



artificially inflated standard errors (Farrar and Glauber, 1967). 
As an alternative to OLS, ridge regression (Hoerl and Kennard, 
2000; McDonald, 2009) and PLS account for multicollinearity 
and/or over-fitting. However, they are not intended for variable 
selection without additional computation such as bootstrapping 
(Abdi, 2010). 

In PLS, latent variables (linear combinations of the predictors) 
are formed using both the outcome(s) and predictors such that all 
pairs of latent variables are orthogonal and have a sample corre- 
lation of zero (Garthwaite, 1994). Regression models are then fit 
using these latent variables rather than the original predictors and 
multicollinearity is no longer a concern. In addition, the number 
of latent variables is often smaller than the number of predic- 
tors, so that PLS reduces the dimensionality of the data and the 
likelihood of over-fitting. However, all predictors are assigned a 
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non-zero weight and inference is not provided, so that variable 
selection is not readily achieved (Tobias, 1997; Chun and Kele§, 
2010). Further detail on the theory underlying PLS regression is 
available elsewhere (Garthwaite, 1994; Wold et al, 2001; Krishnan 
etal, 2011). 

Given standard PLS is not intended for variable selection but 
rather prediction, sparse methods such as sparse partial least 
squares (SPLS) were developed. Variable selection is accom- 
plished by using tuning parameters in the modeling process, 
which drive both the latent variable selection and computation of 
predictors' weights (Chun and Kele§, 2010). Here, estimates may 
be set to zero, indicating a predictor is not significantly associated 
with the outcome. 

Although some weights are zero so as to provide variable selec- 
tion, this can also be viewed as a weakness of SPLS. In data 
exploration and hypothesis generation, effect size and p-values, 
despite insignificance, are often of interest. During exploratory 
analyses, one may wish to increase the type-I error rate and 
allow variables that would otherwise be borderline significant or 
insignificant into the set of selected predictors. Also, one may wish 
to compare standardized estimates of various predictors despite 
insignificance. None of this information is provided by executing 
SPLS in its traditional manner. 

To address these shortcomings, an alternative approach, 
referred to here as "aU-possible" SPLS, is proposed. Briefly, a 
SPLS model is fit for "all possible" values of the model's tun- 
ing parameters, as opposed to fitting only one model based on 
the "optimal" parameters (this latter approach will be referred 
to as "traditional" SPLS). Predictors are ranked by the percent- 
age of time they are chosen across all models, and the average of 
non-zero standardized parameter estimates is given for all pre- 
dictors, even those not chosen by traditional SPLS. Although 
not formal inference such as a p-value, the former gives the 
relative ranking of predictors, allowing one to identify poten- 
tially borderline significant variables, as well as those least likely 
to be predictive of the outcome. Simulation confirms predic- 
tors most strongly associated with the outcome are robust to 
changes in the tuning parameters and continue to be selected 
as sparsity increases, while those with the weakest association 
are less likely to be chosen under high levels of sparsity. This 
approach yields supplementary information lost in the tradi- 
tional application of SPLS, providing increased insight into one's 
data. 

METHODS 
TRADITIONAL SPLS 

The spls package (version 2.1-0) in R (version 2.13.2) based on the 
theory presented by Chun and Kele§ (2010) is considered here. 
The algorithm requires the specification of two tuning parame- 
ters, K and rj. K (an integer between 1 and min{p, (v - l)«/v}, 
where v is the number of folds for the cross-validation (CV), 
p is the number of predictors and n is the sample size (Chung 
et al., 2009)) is the number of latent variables and rj (a continu- 
ous value on the interval [0, 1)) determines the amount of sparsity 
in the algorithm. In general, lower values of r] represent less spar- 
sity (and thus more variables tend to be selected), whereas higher 
values imply more sparsity. However, the choice of K also affects 



variable selection in conjunction with rj (lower values of K tend 
to result in fewer chosen variables). 

To facilitate the choice of K and rj, the package includes a CV 
function, where the "optimal" K and rj are those with the lowest 
mean squared prediction error. For the purposes of this paper, 
"traditional" SPLS refers to the use of this CV to choose one 
pair of "optimal" tuning parameters. Once determined, the SPLS 
model is fit and selected predictors are noted. 

While using traditional SPLS, it was discovered the selection 
of optimal tuning parameters was affected by the seed if CV 
other than leave-one-out (LOO) was used. For example, for 1000 
randomly-chosen seeds, the optimal values of the tuning param- 
eters chosen most often by a 10-fold CV in the real data used in 
Section Data Application: Volumetric MRI Regions as Predictors 
of Cognitive Test Results were K = 2, rj = 0.7. However, they 
were only chosen for 171 seeds out of 1000 — about 17% of the 
time. The next pair chosen most often was K = 3,ri = 0.6, at 106 
times. All of the remaining pairings were chosen less than 10% of 
the time, so that no one pair was selected notably more than the 
others. Note that if K and/or ij differ only by one unit, this can 
mean the addition or exclusion of one or more variables from the 
results. Here, eight predictors were chosen by the first set of tun- 
ing parameters, whereas 17 were chosen by the second, indicating 
instability in the tuning parameter values can cause instability in 
the variable selection process, affecting conclusions. Because of 
the unreliability of the 10-fold CV with these data, LOO CV is 
recommended for traditional SPLS. 

Another consideration with the CV is how fine of a grid to 
use when searching for the optimal value of since, again, it is 
continuous. In the examples provided by the authors of the spls 
package, rj maybe one of 0.1, 0.2, 0.3, . . . , 0.9 (Chung et al., 2009; 
Chun and Kele§, 2010). Given this, and also the fact that consid- 
ering more -values results in significantly more computational 
time, (^-values of 0.1, 0.2, 0.3, . . . , 0.9 were used in this paper as 
well. 

"ALL-POSSIBLE" SPLS 

"All-possible" is quoted because, given r] is continuous, one 
cannot actually achieve every possible combination of tuning 
parameters. Given a discrete subset of rj (here, {0.1, 0.2, . . . , 0.9}), 
however, one considers "all possible" combinations of the param- 
eters. Specifically, there wiU heK x rj total models fit, one for each 
combination of K and with standardized estimates recorded 
in each instance. The results are the percentage of time chosen 
(i.e., the parameter estimate was non-zero), as well as the average 
non-zero standardized parameter estimate. 

It should be noted that with this method it is expected all pre- 
dictors will be chosen a reasonable number of times (usually in 
at least 70% of the models). This is because once a large enough 
K- and/or small enough -value is used, the method no longer 
induces enough sparsity to allow for variable selection — it essen- 
tially acts like PLS and chooses all variables. Since all pairings of K 
and rj were considered here, many of them resulted in all variables 
being selected. 

There are two advantages to all-possible SPLS. First, by rank- 
ing the variables based on how often they are chosen across all 
models, one has a relative way to compare them, as opposed to 



Frontiers in Neuroinformatics 



www.frontiersin.org 



March 2014 | Volume 8 | Article 18 | 2 



Olson Hunt et al. 



Sparse partial least squares selection 



"chosen" or "not chosen." Specifically, one can see those variables 
selected most and least fi-equently, as well as those that were 
somewhere in between. In this way, one obtains a continuum of 
information instead of a dichotomy. Second, an effect size for all 
predictors — not just those chosen by traditional SPLS — is pro- 
vided. Thus, even if a predictor was only selected 75% of the time, 
one still has information on its estimate whenever it was selected. 

SIMULATION 
SIMULATION STRUCTURE 

A design analogous to that in Chun and Kele§ (2010) was used 
to create coUinear predictors of varying association with the 
outcome — one set of predictors was strongly associated, another 
weakly and a third not at all. For j = 1,2,3 and _ i + 1 < i < Cj, 
where (co, ci, C2, cj,) = (0, 7, 17, 27), predictors were of the form 
Xj = my + gj. Given a sample size of n = 100, my were each vec- 
tors of length 100 from N{0, 20Iioo) and e; ~N'(0, Iioo). Lastly, 
y = 2mi — 0.2m2 -|- x, where x ~JV(0, Iioo)- AH variables were 
standardized while other settings for the SPLS function were kept 
at default. 

PREDICTORS WITH WEAKER ASSOCIATION ARE LESS LIKELY TO BE 
CHOSEN WITH INCREASED SPARSITY 

This simulation demonstrated how predictors with varying levels 
of association with y are affected by changes in the tuning param- 
eter pair, {K, rj). The general pattern is that for lower values of K 
and higher values of r], sparsity increases and fewer variables are 
selected. Here, -K" = {1, . . . , 27} and again rj = {0.1, . . . , 0.9}. 

Consider three sets of predictors: Si = {xi, . . . , X7} (strongly 
associated with y), 82 = {xg, . . ., X17} (weakly associated) and 
S3 = {xi8, . . . , X27} (not associated). For each d = I, D = 
1000 samples drawn randomly from the distribution as outlined 
in Section Simulation Structure, a SPLS model was run for all 
pairs of K and rj. The percentage of predictors chosen from each 



set was noted for each pair and the average across all 1000 data sets 
is shown in Figures 1A,B for S2 and S3. Note that K only ranges 
from 1 to 15, as after K = 15, the average was 100% for all pairs of 
tuning parameters. For Si , all seven predictors were always chosen 
(i.e., the average was always 100%). 

These results confirm variables in set S3 (not associated with y) 
were less likely to be chosen as K decreased and rj increased (i.e., 
sparsity increased). Variables in S2 showed a similar pattern due 
to their weak association, although their rate of selection was 
notably higher than those in S3. The fact that all variables in 
Si were chosen for 100% of the (K, rj) pairs across all D data 
sets shows strongly associated variables are robust to changes in 
the tuning parameters. Subsequently, calculating the percentage 
of time a variable is selected over all pairs of tuning parameters 
(i.e., conducting all-possible SPLS) will result in those with the 
strongest association having the highest percentage of time cho- 
sen, while the opposite will be true for those with the weakest. 
This is shown via simulation in the next section. 

PERCENTAGE OF TIME CHOSEN AND AVERAGE NON-ZERO 
STANDARDIZED ESTIMATES 

For each of d = 1, . . . , D = 1000 samples from the distribution 
as described in Section Simulation Structure, all-possible SPLS 
was conducted: For a given data set, a SPLS model was run for all 
pairs ofK = {I, . . . , 27} and rj = {0.1, . . . , 0.9}. Recorded was the 
percentage of time each variable was chosen, as well as the mean 
non-zero standardized parameter estimates. Table 1 reports the 
average of these percentages and mean estimates across all 1000 
samples, in order to assess the method's behavior in the long run. 

The average percentage of time chosen for all predictors in Si 
was 100, while those in S2 and S3 were all chosen around 96% 
and 90% of the time on average, respectively, resulting in three 
distinct groups. The average mean non-zero standardized esti- 
mates for those in Si were all around 0.15, while those in S2 were 




FIGURE 1 I (A) shows the average percentage of variables in S2 selected for each pair of tuning parameters across D = 1000 simulated data sets, while (B) 
shows this for S3. 
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Table 1 | From all-possible SPLS conducted on D = 1000 samples 
from the same distribution. 



Predictor 


Average percentage 


Average of mean 




of time chosen 


non-zero standardized p 


Si xi 


100 


0.144 


X2 


100 


0.147 


X3 


100 


0.145 


X4 


100 


0.142 


X6 


100 


0.144 


X6 


100 


0.147 


X7 


100 


0.145 


So Xn 


96.5 


-0.011 


" 1 1 


96.482 


-0.011 


Xl7 


96.476 


-0.010 


X8 


96.474 


-0.011 


X10 


96.472 


-0.011 


X15 


96.466 


-0.012 


X16 


96.465 


-0.010 


X14 


96.460 


-0.007 


X14 


96.453 


-0.009 


X9 


96.451 


-0.011 




89.912 


0.0011 




89.886 


-0.0003 


X24 


89.851 


-0.0051 


"18 


89.839 


-0.0026 


X27 


89.786 


0.0021 


X23 


89.778 


0.0038 


X26 


89.770 


0.0007 


X22 


89.734 


-0.0023 


X26 


89.730 


0.0001 


Xl9 


89.710 


0.0035 



Average percentage of time a variable was chosen and its average mean non- 
zero standardized parameter estimate across D data sets. Variables are ordered 
by average percentage. 



about —0.01, and those in S3 were always smaller than those in 
S2 (and Si). Both the magnitudes and directions of the estimates 
for Si and S2 were as expected given the structure of the data out- 
lined in Section Simulation Structure and the fact that estimates 
were standardized. The small magnitudes and varying directions 
of predictors in S3 were reasonable, as they should have estimates 
that hover around zero. 

DATA APPLICATION: VOLUMETRIC MRI REGIONS AS 
PREDICTORS OF COGNITIVE TEST RESULTS 

In neuroimaging, brain regions tend to be numerous and highly 
correlated, so that over-fitting and multicoUinearity are of con- 
cern. Here, a well-established predictor-outcome relationship is 
used to illustrate the proposed SPLS method. 

DATA COLLECTION 

Participants 

Data were obtained from the Cardiovascular Health Study (CHS), 
which is an ongoing, population-based, longitudinal study, and 



the Healthy Brain Project (HBP), a sub-study of the Health, 
Aging and Body Composition (Health ABC) Study, which is also 
longitudinal and population-based. 

The CHS is a study of coronary heart disease and stroke risk 
in older adults. Briefly, 5888 community-dwelling older adults 
were identified between 1987 and 1993 from Medicare eligibil- 
ity lists in four clinical centers (Forsyth County, NC; Sacramento 
County, CA; Washington County, MD and Pittsburgh, PA) (Fried 
and Borhaiii, 1991). Participants were recruited if they were age 
65 or older at time of recruitment, non-institutionalized, not 
wheelchair-bound or undergoing active cancer treatment, able to 
give informed consent and expected to remain in the area for 
at least 3 years. The participants had annual clinic examinations 
through 1998-1999. 

Brain MRIs were acquired for 523 participants in Pittsburgh 
in 1997-1999 (Lopez et al., 2003). Compared to the participants 
who did not have a brain MRI, these participants were younger, 
more likely to have more years of education and had a lower 
prevalence of cardiovascular diseases and cerebrovascular find- 
ings (Rosano et al, 2006, 2007a). In 2003-2004, a random sample 
of 327 brain MRIs from the 523 were re-read (Rosano et al., 
2005, 2007a,b, 2008). No significant differences were observed 
with regard to demographics or health-related factors between 
these 327 participants and the 523 total subjects. 

The Health ABC study began in 1997-1998 as a longitudi- 
nal, observational cohort study of 3075 well-functioning older 
adults from Pittsburgh, PA and Memphis, TN (Simonsick et al., 
2001). Participants were enrolled if they were 70-79 years old and 
reported no difficulty walking a quarter of a mile (400 meters), 
climbing 10 steps or performing activities of daily living; were free 
of life-threatening cancers with no active treatment within the 
prior 3 years and had planned to remain within the study area for 
at least 3 years. In 2006-2007, 314 Health ABC participants from 
the Pittsburgh site who were interested in and eligible for a brain 
3T MRI received a MRI in addition to in-person Health ABC 
assessments. This ancillary study of the Health ABC is referred 
to as the HBP 

Both studies have been approved by the institutional review 
boards of the University of Pittsburgh. 

Magnetic Resonance Imaging (MRI) Measures 

In both the CHS and HBP, brain MRI assessments included vol- 
umetric measures of gray matter for both individual regions and 
the whole brain. 

The brain MRI protocol for the CHS carried out in 1997- 
1999 has been described elsewhere (Yue et al., 1997). Briefly, 
sagittal Tl -weighted localizer sequences and axial spin-echo 
spin-density-weighted, spin-echo T2-weighted and Tl-weighted 
images were acquired using a 1.5T scanner. A volumetric Spoiled 
Gradient Recalled Acquisition (SPGR) sequence with parameters 
optimized for maximal contrast among gray matter, white mat- 
ter and cerebrospinal fluid (CSF) was acquired in the coronal 
plane (echo time/repetition time (TE/TR) = 5/25, flip angle = 
40 deg., NEX = 1, slice thickness = 1.5/0 mm interslice). All MRI 
data were interpreted at a central MRI Reading Center using a 
standardized protocol (Bryan et al., 1997; Yue et al., 1997). 

The protocol for the HBP study was performed with a 
Siemens 12-channel head coil and 3T Siemens Tim Trio MR 
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scanner at the Magnetic Resonance Research Center, University 
of Pittsburgh (Venkatraman et al., 2011; Rosano et al., 2012a,b). 
Magnetization-prepared rapid gradient echo Tl-weighted images 
(MPRAGE) were acquired in the axial plane (TO = 2300ms, 
TE = 3.43 ms, imaging time (TI) = 900ms, 9° flip angle, 256 x 
224 mm field of view (FOV), 1 x 1mm voxel size, 256 x 224 
matrix size, 176 slices and 1 mm thick). Fluid-attenuated inver- 
sion recovery (FLAIR) images were acquired in axial plane {TR = 
9160 ms, r£ = 89ms, 77 = 2500 ms, 150° flip angle, 256 x 
212 mm FOV, 256 x 240 matrix size, 48 slices, 3 mm thick and 
1x1 mm voxel size). Diffusion-weighted images were acquired 
using a single short spin-echo sequence {TR = 5300 ms, TE = 
88 ms, TI = 2500 ms, 90° flip angle, 256 x 256 mm FOV, two 
diffusion values of = 0 and 1000 s/mm, 12 diffusion direc- 
tions, four repetitions, 40 slices, 3 mm thick, 128 x 128 matrix 
size, 2x2 mm voxel size and GRAPPA = 2). A neuroradiologist 
examined each MRI for neurologic abnormalities. A radiolo- 
gist verified the presence of abnormalities with potential clini- 
cal relevance. No images were excluded because of unexpected 
findings. 

Voxel counts of the gray matter were obtained for individual 
regions of interest and for the whole brain using a procedure 
previously described (Zhang et al., 2001; Tzourio-Mazoyer et al, 
2002; Rosano et al, 2005; Wu et al, 2006). After skull and scalp 
stripping (Smith, 2002), and after segmentation of gray matter, 
white matter and CSF, the brain atlas and the individual sub- 
ject brain were aligned and intensity normalization was done on 
each subject's structural image (SPGR for the CHS and MPRAGE 
for the HBP images), as well as on the template colin27, to give 
each subject the same orientation and image intensity distribu- 
tion as the template and to improve the registration accuracy. 
For both the CHS and HBP, FMRIB-FAST was applied to seg- 
ment the image into gray matter, white matter and CSF, while 
also correcting for spatial intensity variations such as bias field or 
radio-frequency inhomogeneities (Rosano et al., 2005; Wu et al., 
2006). The registration procedure used a fuUy-deformable auto- 
matic algorithm (Thirion, 1998) that does not warp or stretch the 
individual brain, and thus minimizes measurement inaccuracies 
(Wu et al., 2006). Volumes were converted from number of voxels 
to cubic millimeters. 

DATA DESCRIPTION AND PREPARATION 
Dependent variable 

Scores from the Modified Mini-Mental State Examination (3MS) 
were used as the dependent variable, as it is a highly studied 
outcome with regard to memory. The 3MS is a brief, general 
cognitive battery with components for orientation, concentra- 
tion, language, praxis and immediate and delayed memory (Teng 
and Chui, 1987). Because scores tend to be clustered at the 
high end of the scale, a transformation for left-skewed data was 
used:-ln(101 - 3MS), where 3MS represents the test score for a 
given individual (Shackman et al., 2006). 

Regions of interest and confounding variables 

A tiered hypothesis was formed based on the strength of cur- 
rent findings, with the expectation that primary regions would 
have the strongest association with 3MS, followed by secondary 



regions. A third set of regions referred to as "non-hypothesized" 
were not expected to be associated with the outcome. 

The primary hypothesized regions were the hippocampus, 
parahippocampus and entorhinal cortex (Zola-Morgan and 
Squire, 1993; Dickerson et al, 2001). The secondary hypothesis 
included additional memory-related regions: amygdala, caudate 
and medial parietal, lateral parietal and posterior cingulate cor- 
tices (Packard and Knowlton, 2002; Koivunen et al, 201 1; Squire 
and WLxted, 2011). Lastly, non-hypothesized regions were those 
traditionally related to motor tasks and performance (not mem- 
ory): putamen, pallidum, thalamus, supplementary motor cor- 
tex, cerebellum, and post-central and pre-central gyri (Rosano 
et al, 2007a). Because the pallidum measurements were highly 
skewed right, the natural logarithm of these values was used. 
Regions were not normalized, as total gray matter parenchyma 
was included as a covariate. 

The following variables were included as predictors in all 
models because of prior work indicating an association with 
3MS and/or brain structure (Brickman et al., 2008; Raji et al., 
2010): race (coded as white and all other races), sex, age, 
obesity (indicated by a BMI greater than 30) and total brain 
parenchyma volume (here, represented by total gray matter vol- 
ume). The treatment of confounding variables here is anal- 
ogous to that in the OLS regression framework: They were 
included in all models and never removed, even if they were 
ultimately not significant. Thus, the interpretation of a set of 
selected variables is that they are significantly related to the out- 
come, controlling for confounding variables and all other brain 
regions. 

Influential points 

Before the analysis commenced, potentially influential data points 
were determined by modeling each predictor against each out- 
come individually and calculating externally studentized residuals 
in each case (SAS Institute Inc, 2008). Any observation with a 
residual greater than 2.5 in absolute value was removed from the 
analysis (this value is slightly less conservative than the cut-off of 
2 suggested by the SAS documentation). 

Three observations were removed from the HBP data based 
on the above criterion, while 11 were removed from the CHS. 
In both data sets, influential points were those with a notably 
smaU/large 3MS value paired with a large/small regional volume. 
The only exception was one observation in the HBP data, which 
had a very large total brain volume relative to the other subjects. 
For each data set, there were some subjects with invalid MRIs 
and/or missing covariate values, so that after removing these sub- 
jects and also the influential observations, the final sample size 
for the CHS was n = 286, while n = 302 for the HBR In Table 2, 
li-values for differences in demographic measures between the 
CHS and HBP cohorts were obtained either by a chi-square test, 
two-sample f-test or the Kruskal-Wallis Test when normality was 
suspect. 

Analyses were conducted using R version 2.13.2 (spls pack- 
age 2.1-0) and SAS version 9.2 (SAS Institute Inc, 2008). Both 
the dependent and continuous independent variables were stan- 
dardized, and, unless otherwise mentioned, all other settings were 
kept at default for all functions/procedures used. Run-time for the 
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Table 2 | Demographic and MRI volumetric summaries for 
Cardiovascular Health Study and Healthy Brain Project participants. 

Cardiovascular Healthy brain p-value 





health study 


proj6ct 




Sample size 


n= 286 


n = 302 




Female (n, %) 


177 (62%) 


174 (58%) 


0.33 


White (n, %) 


224 (78%) 


181 (60%) 


<0.001'' 


Obese (n, %) 


46 (16%) 


79 (26%) 


0.004= 


Age Imean (SD)1 


78 (4.0) 


83 (2.8) 


<0.001'' 


SMS score [mean (SD)1 


93.6 (5.2) 


92.9 (6.7) 


0.92 


MRI volumes [mean mm^ (SD)]'' 








Amygdala 


2786 (605) 


2934 (419) 




Anterior cingulate cortex 


10554 (1587) 


9615 (1517) 




Caudate 


7704 (1835) 


8586 (2047) 




Cerebellum 


68220 (24799) 


99264 (12788) 




Dorsolateral prefrontal cortex 


67790 (9706) 


26837 (3240) 




Entorhinal cortex 


3922 (808) 


3702 (664) 




Hippocampus 


9649 (1296) 


9452 (1205) 




Lateral parietal inferior cortex 


10368 (2061) 


10990 (1527) 




Lateral parietal superior cortex 


9719 (2342) 


11787 (1807) 




Medial parietal cortex 


18483 (3408) 


20611 (3031) 




Pallidum (natural logarithm) 


5.49 (1.11) 


5.90 (0.93) 




Parahippocampus 


10144 (1570) 


10663 (1608) 




Parenchyma (total gray matter) 


466482 (66738) 


527997 (55062) 




Post-central gyrus 


15255 (3132) 


18972 (2730) 




Posterior cingulate cortex 


2557 (458) 


2816 (695) 




Pre-central gyrus 


13485 (2579) 


16741 (2538) 




Putamen 


2192 (1946) 


3002 (2443) 




Supplementary motor cortex 


9260 (2168) 


128218 (2220) 




Thalamus 


1872 (1016) 


1856 (460) 





^Significant with a = 0.05. 

^Mean volumes are expected to differ since tfie CHS and HBP used different 
MR scanners, thus p-values are not reported. 



SPLS analyses of interest was less than 5 minutes on a machine 
with the Windows 7 operating system (64 bit) and a 2.16 GHz 
Intel Core 17 processor. 

MULTICOLLINEARITY AND OVER-FiniNG 

MulticoUinearity was assessed using the condition num- 
ber by fitting an OLS regression model that included all 
regions of interest and a priori confounders, where a value 
greater than 100 indicated significant multicoUinearity 
(Belsleyet al, 1980). 

The CHS cohort had a condition number of 190, while the 
HBP group had a value of 227. Since both are notably larger than 
100, multicoUinearity is likely present in these data when all MRI 
regions are considered simultaneously in the same model (Belsley 
et al, 1980). 

While the number of predictors (23) was not larger than 
the sample sizes (297 and 302), various rules of thumb indi- 
cate there should be 10-20 observations for each predic- 
tor in a model (Harrell, 2001). This suggests one should 
have at least 230 observations, and potentially as many as 
460, which could indicate potential over-fitting with these 
data. 



Table 3 | Results from all-possible (first two columns) and traditional 
(last column) SPLS for the Healthy Brain Project. 



Brain 


% Times 


Mean 


Traditional 


region 


chosen with 


non-zero 


SPLS 




all-possible 


P from 


P 




method 


all-possible 








method 




^ 1— 1 inni^i^a mm ic 
1 1 IjjptJL'a 1 1 1 [JUo 


100 


0 276 


0 262 


^ Psrshippocsmpus 


100 


0.258 


0.180 


'^Amygdala 


976 


0.137 


0.065 


Anterior cingulate 


96.6 


0.088 


0.091 


L.UI LCA 








^Entorhinal cortex 


96.1 


—0.279 


—0.159 


'^Medial Parietal 


96.1 


0.148 


0.158 


L,U\ LcA 








'^i innlomGnta r\/ 
iju|j[jit;i 1 itJi iicii y 


96 1 


—0 187 


—0 310 


mntnr f^nrtpv 

1 1 IVJLL/l VjL/I LCA. 








Thq 1 ami IQ 

1 1 [G 1 G 1 1 1 \J O 


95.2 


—0.104 


—0.098 


Dorsolateral 


92.3 


0.071 


0.020 


prefrontal cortex 








LdLcldl pdlltiLdl 


91 8 


—0 163 


n ini 










rdlMUUlM U'dLUIdl 


90 3 


—0 106 


— U. 1 1 o 


irinanthml 
lui^dl IL[ II 1 1/ 








^Lateral parietal 


QQ /i 


U.Uoo 




inferior cortex 








Post-central gyrus 


88.4 


0.056 


0.079 


Putamen 


85.0 


0.031 




Pre-central gyrus 


84.5 


-0.024 




Cerebellum 


84.1 


0.006 


0.012 


''Caudate 


83.6 


-0.021 




''Posterior 


82.6 


-0.033 





cingulate cortex 

Regions are ordered by the percentage of time chosen across all models by 
the all-possible method (first column), with the horizontal lines separating poten- 
tial groupings of predictors with regard to their strength of association with the 
outcome. 

^Primary hypothesized region. 
''Secondary hypothesized region. 

'^Applicable only for those regions chosen by traditional SPLS using optimal 
tuning parameters. 

SPARSE PARTIAL LEAST SQUARES ANALYSIS 

The spls package based on the theory presented by Chun and 
Kele§ (2010) was used for both traditional and all-possible SPLS 
(Tables 3, 4). Horizontal lines show potential empirically-driven 
cut-points that indicate varying levels of association between the 
predictors and outcome. 

Within the HBP data set (Table 3), all-possible SPLS largely 
confirmed the proposed hypotheses by choosing two of the pri- 
mary regions (hippocampus and parahippocampus) 100% of the 
time and the third (entorhinal cortex) in 96.1% of the models. 
Additionally, the three largest average non-zero parameter esti- 
mates from all-possible (second column) were for the three pri- 
mary regions: entorhinal cortex (—0.279), hippocampus (0.276) 
and parahippocampus (0.258). This contrasts traditional SPLS 
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Table 4 | Results from all-possible (first two columns) and traditional 
(last column) SPLS for the Cardiovascular Health Study. 



Brain 


/o 1 II 1 it?a 


/*\vt?i ciyc? 


Tra H ■ t ! n n 9 1 
II ciui Liui lai 


region 


chosen with 


non-zero 


SPLS 




all-possible 


P from 


r 




method 


all-possible 








method 




3 Parah innnf^a mm ic 


100 


0 196 


0 111 


^Hippocampus 


100 


0.141 


0.093 


''Medial parietal 


100 


0.228 


0.054 


cortex 








'■Lateral parietal 


97.6 


0.220 


0.060 


inferior cortex 








PfllliHiim (natiiral 


96.6 


0.131 


0.126 


logarithm) 








Pre-central gyrus 


96.1 


0.196 


0.021 


''Lateral parietal 


94.7 


-0.290 




superior cortex 








Dorsolateral 


Q9 P 

yz.o 


U.U 1 o 


U.UZD 


prefrontal cortex 








Supplementary 


90.8 


-0.128 


-0.110 


motor cortex 








Ml 1 lyyudid 


89 9 


0 053 


0 084 


''Entorhinal cortex 


89.4 


-0.132 




''Posterior 


88.9 


-0.062 




cingulate Cortex 








Thalamus 


88.9 


—0.078 




Post-central gyrus 


870 


-0.070 




Putamen 


84.1 


0.054 




Anterior cingulate 


82.6 


0.005 




cortex 








''Caudate 


79.7 


0.022 




Cerebellum 


79.2 


0.050 





Regions are ordered by the percentage of time chosen across ail modeis by 
the all-possibie method (first coiumnj, with the horizontal lines separating poten- 
tial groupings of predictors with regard to their strength of association with the 
outcome. 



"Primary hypothesized region. 
^Secondary hypothesized region. 

'^Applicable only for those regions chosen by traditional SPLS using optimal 
tuning parameters. 

in that the region with the largest estimated magnitude (third 
column) was the supplementary motor cortex (—0.310), yet this 
was not a hypothesized region. Although chosen a relatively large 
percentage of the time by all-possible (96.1%), this region was 
ranked below/tied with all three primary regions and two sec- 
ondary (amygdala, medial parietal cortex). It also had a smaller 
average estimate (—0.187) than all three primary regions. Thus, 
this region was deemed most significant by traditional, but ranked 
below multiple hypothesized regions by all-possible. 

Traditional SPLS also chose post-central gyrus and cerebellum, 
so that one might conclude these regions are significantly predictive 
of 3MS, yet cerebellum was the third lowest-ranked region by 
all-possible (84.1%), and post-central the sixth lowest (88.4%). 



Lastly, the additional information gained by all-possible SPLS 
(ranking according to percent) indicates the lateral parietal 
inferior cortex is a potentially borderline significant predictor 
(89.4%), which could not have been known based on the tradi- 
tional results, as its parameter estimate was set to zero. 

Despite being secondary regions, neither the caudate nor pos- 
terior cingulate cortex were chosen by either method, so that 
the results were consistent in this way and may indicate a differ- 
ent relationship in a multivariable setting than has been seen in 
previous studies involving individual predictors. 

The CHS results (Table 4) are notably consistent with those 
from the HBP data. Specifically, two primary regions (parahip- 
pocampus, hippocampus) were again chosen in 100% of the 
models, although the third primary region (entorhinal cortex) 
was selected less often, at 89.4%. However, this region had a 
larger average parameter estimate (—0.132) than all other regions 
selected less than 90% of the time, and some regions selected in 
greater than 90% of the models (pallidum, dorsolateral prefrontal 
and supplementary motor cortices, all non-hypothesized). This 
again shows the utility of all-possible SPLS in that it highlighted 
a potentially important, borderline predictor that was missed by 
traditional. 

The regions with the largest average magnitudes according to 
all-possible were the lateral parietal superior (—0.290), medial 
parietal (0.228) and lateral parietal inferior (0.220) cortices 
(all secondary), and the parahippocampus (0.196), a primary 
region, so that the top four largest estimates were associated with 
hypothesized regions. Alternatively, traditional SPLS assigned 
the largest parameter estimate to the pallidum (0.126), fol- 
lowed by the parahippocampus (0.111) and the supplementary 
motor cortex (—0.110), so that two of the three regions with 
the largest estimates according to traditional SPLS were non- 
hypothesized. In contrast, all-possible ranked both the pallidum 
(96.6%) and supplementary motor (90.8%) lower than two 
primary (parahippocampus, hippocampus) and two secondary 
(medial parietal, lateral parietal inferior cortices) regions (and 
also lower than lateral parietal superior in the case of the sup- 
plementary motor cortex). 

Lastly, the posterior cingulate cortex and caudate, despite 
being secondary regions, were not chosen by either method. 
This finding for the caudate is consistent with that from 
the HBP 

DISCUSSION 

The purpose of this study was to illustrate that all-possible 
SPLS provides additional, useful information not attainable by 
traditional SPLS: relative rankings and parameter estimates for 
non-selected predictors. Simulation verified that predictors not 
associated with the outcome are selected less often as sparsity 
increases, while strong, and in most cases weak, associations 
remain robust. Additionally, conducting aU-possible SPLS a large 
number of times showed that, on average, the percentage of time 
chosen and mean non-zero standardized estimates were consis- 
tent with the structure of the simulated data. A real data example 
indicated all-possible SPLS was more successful at highlighting 
hypothesized relationships than traditional SPLS, and also gave 
useful information about borderline variables that could not 
otherwise have been known. 
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Given the CHS and HBP data sets differed with respect to 
neuroimaging protocols and demographics, it is notable that all- 
possible SPLS detected hypothesized associations across these 
cohorts, suggesting robustness in the method. Specifically, MR 
scanners had different field strengths: the CHS MRIs were 
obtained with a 1.5 Tesla, the HBP with a 3.0. Additionally, pro- 
tocols with different spatial resolutions were used across groups: 
the CHS applied a 5.0 mm slice, whereas the HBP applied a 1.5. 
Lastly, the cohorts were significantly different with regard to race, 
obesity and age (although these factors were controlled for in all 
models). Despite these differences between data sets, the method 
yielded consistent results overall, indicating its utility as variable 
selection technique. 

A weakness of all-possible SPLS is its relative nature (i.e., 
ranking by percentage) with no strict cut-off value due to a 
lack of distributional properties. For example, in the simulation 
in Section Percentage of Time Chosen and Average Non-Zero 
Standardized Estimates (Table 1), the average percentage defined 
three distinct groups, but with no insight into significance (or 
lack thereof). However, viewing the predictors in this way allows 
one to see more detail than the dichotomous results of tradi- 
tional SPLS, and to apply a cut-off if desired, where the value 
would be based on empirical experience, rather than guided by 
theory. 

By utilizing simulation and a well-studied predictor-outcome 
relationship across two independent studies, the current findings 
validate this variation of SPLS as a useful technique for selecting 
variables in situations where other approaches (namely, OLS) fail. 
The results of this study suggest all-possible SPLS could be used 
for hypothesis generation without having to restrict the set of 
predictors due to multicoUinearity or a comparatively small sam- 
ple size, which geneticists, neuroscientists, economists and social 
scientists often encounter. The additional information given by 
all-possible SPLS is especially useful in exploratory analyses, as it 
allows for a more thorough understanding of the data than can be 
provided by the binary results of traditional SPLS. 
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